Project Title : Cybersecurity Attack Analysis¶
About the Dataset :¶
Incribo's synthetic cyber dataset is a collection of 40,000 records that contains 25 different metrics. The data is designed to represent realistic travel history, making it a valuable resource for cybersecurity analysis tasks. Analysts can use the dataset to assess heatmaps, attack signatures, and other types of cybersecurity data.
Cybersecurity Dataset¶
| Column Name | Description |
|---|---|
| Timestamp | Date and time of the internet activity |
| Source IP Address | Internet address of the sender |
| Destination IP Address | Internet address of the receiver |
| Source Port | Number used by the sender to send information |
| Destination Port | Number used by the receiver to get information |
| Protocol | Language used by the devices to talk to each other (e.g., chat, email) |
| Packet Length | Size of the information package sent over the internet |
| Packet Type | Kind of information package (e.g., regular message, control message) |
| Traffic Type | Type of internet activity (e.g., browsing websites, sending emails) |
| Payload Data | The actual content sent over the internet |
| Malware Indicators | Signs that something bad (malware) might be trying to sneak in |
| Anomaly Scores | Numbers showing unusual activity compared to normal internet use |
| Alerts/Warnings | Notifications from security systems saying something suspicious might be happening |
| Attack Type | Kind of cyberattack that was done or might be happening (e.g., overwhelming a system with traffic, stealing information) |
| Attack Signature | Unique fingerprint of a known cyberattack |
| Action Taken | What was done to stop the threat |
| Severity Level | How serious the threat was (e.g., not serious, kind of serious, very serious) |
| User Information | Details about the person using the internet |
| Device Information | Details about the computer or phone being used |
| Network Segment | Part of the internet where the activity happened |
| Geo-location Data | Location information based on internet addresses |
| Proxy Information | Details about any relays used to connect to the internet |
| Firewall Logs | Records of what the security system allowed or blocked on the internet |
| IDS/IPS Alerts | Notifications from systems that watch for cyberattacks |
| Log Source | Where the information came from (e.g., security software, router) |
Importing the Required Libraries¶
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
Load the Dataset¶
df = pd.read_csv("cybersecurity_attacks.csv")
df.head(5)
| Timestamp | Source IP Address | Destination IP Address | Source Port | Destination Port | Protocol | Packet Length | Packet Type | Traffic Type | Payload Data | ... | Action Taken | Severity Level | User Information | Device Information | Network Segment | Geo-location Data | Proxy Information | Firewall Logs | IDS/IPS Alerts | Log Source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-05-30 06:33:58 | 103.216.15.12 | 84.9.164.252 | 31225 | 17616 | ICMP | 503 | Data | HTTP | Qui natus odio asperiores nam. Optio nobis ius... | ... | Logged | Low | Reyansh Dugal | Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ... | Segment A | Jamshedpur, Sikkim | 150.9.97.135 | Log Data | NaN | Server |
| 1 | 2020-08-26 07:08:30 | 78.199.217.198 | 66.191.137.154 | 17245 | 48166 | ICMP | 1174 | Data | HTTP | Aperiam quos modi officiis veritatis rem. Omni... | ... | Blocked | Low | Sumer Rana | Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ... | Segment B | Bilaspur, Nagaland | NaN | Log Data | NaN | Firewall |
| 2 | 2022-11-13 08:23:25 | 63.79.210.48 | 198.219.82.17 | 16811 | 53600 | UDP | 306 | Control | HTTP | Perferendis sapiente vitae soluta. Hic delectu... | ... | Ignored | Low | Himmat Karpe | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ... | Segment C | Bokaro, Rajasthan | 114.133.48.179 | Log Data | Alert Data | Firewall |
| 3 | 2023-07-02 10:38:46 | 163.42.196.10 | 101.228.192.255 | 20018 | 32534 | UDP | 385 | Data | HTTP | Totam maxime beatae expedita explicabo porro l... | ... | Blocked | Medium | Fateh Kibe | Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ... | Segment B | Jaunpur, Rajasthan | NaN | NaN | Alert Data | Firewall |
| 4 | 2023-07-16 13:11:07 | 71.166.185.76 | 189.243.174.238 | 6131 | 26646 | TCP | 1462 | Data | DNS | Odit nesciunt dolorem nisi iste iusto. Animi v... | ... | Blocked | Low | Dhanush Chad | Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... | Segment C | Anantapur, Tripura | 149.6.110.119 | NaN | Alert Data | Firewall |
5 rows × 25 columns
Exploratory Data Analysis¶
# List Columns
df.columns
Index(['Timestamp', 'Source IP Address', 'Destination IP Address',
'Source Port', 'Destination Port', 'Protocol', 'Packet Length',
'Packet Type', 'Traffic Type', 'Payload Data', 'Malware Indicators',
'Anomaly Scores', 'Alerts/Warnings', 'Attack Type', 'Attack Signature',
'Action Taken', 'Severity Level', 'User Information',
'Device Information', 'Network Segment', 'Geo-location Data',
'Proxy Information', 'Firewall Logs', 'IDS/IPS Alerts', 'Log Source'],
dtype='object')
# Shape of data
print(f"There are {df.shape[0]}, row and {df.shape[1]} columns in the Cybersecruity dataset")
There are 40000, row and 33 columns in the Cybersecruity dataset
# Dataset Info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40000 entries, 0 to 39999 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Timestamp 40000 non-null object 1 Source IP Address 40000 non-null object 2 Destination IP Address 40000 non-null object 3 Source Port 40000 non-null int64 4 Destination Port 40000 non-null int64 5 Protocol 40000 non-null object 6 Packet Length 40000 non-null int64 7 Packet Type 40000 non-null object 8 Traffic Type 40000 non-null object 9 Payload Data 40000 non-null object 10 Malware Indicators 20000 non-null object 11 Anomaly Scores 40000 non-null float64 12 Alerts/Warnings 19933 non-null object 13 Attack Type 40000 non-null object 14 Attack Signature 40000 non-null object 15 Action Taken 40000 non-null object 16 Severity Level 40000 non-null object 17 User Information 40000 non-null object 18 Device Information 40000 non-null object 19 Network Segment 40000 non-null object 20 Geo-location Data 40000 non-null object 21 Proxy Information 20149 non-null object 22 Firewall Logs 20039 non-null object 23 IDS/IPS Alerts 19950 non-null object 24 Log Source 40000 non-null object dtypes: float64(1), int64(3), object(21) memory usage: 7.6+ MB
Examining the Null and Missing Values¶
Let's check for missing data! Understanding null and missing values is important for accurate analysis.
df.isnull().sum().sort_values(ascending=False)
Alerts/Warnings 20067 IDS/IPS Alerts 20050 Malware Indicators 20000 Firewall Logs 19961 Proxy Information 19851 Attack Type 0 Geo-location Data 0 Network Segment 0 Device Information 0 User Information 0 Severity Level 0 Action Taken 0 Attack Signature 0 Timestamp 0 Source IP Address 0 Anomaly Scores 0 Payload Data 0 Traffic Type 0 Packet Type 0 Packet Length 0 Protocol 0 Destination Port 0 Source Port 0 Destination IP Address 0 Log Source 0 dtype: int64
# Missing Value by Percentage
df.isnull().sum() / len(df) * 100
Timestamp 0.0000 Source IP Address 0.0000 Destination IP Address 0.0000 Source Port 0.0000 Destination Port 0.0000 Protocol 0.0000 Packet Length 0.0000 Packet Type 0.0000 Traffic Type 0.0000 Payload Data 0.0000 Malware Indicators 50.0000 Anomaly Scores 0.0000 Alerts/Warnings 50.1675 Attack Type 0.0000 Attack Signature 0.0000 Action Taken 0.0000 Severity Level 0.0000 User Information 0.0000 Device Information 0.0000 Network Segment 0.0000 Geo-location Data 0.0000 Proxy Information 49.6275 Firewall Logs 49.9025 IDS/IPS Alerts 50.1250 Log Source 0.0000 dtype: float64
Missing Values in Cybersecurity Dataset¶
We've identified significant missing values in several columns of our cybersecurity dataset, containing 40,000 rows and 25 columns. Here's a breakdown of the most concerning columns:
- Alerts/Warnings: 50.17% (20,067 missing values)
- IDS/IPS Alerts: 50.13% (20,050 missing values)
- Malware Indicators: 50.00% (20,000 missing values)
- Firewall Logs: 49.90% (19,961 missing values)
- Proxy Data: 49.63% (19,851 missing values)
As you can see, these columns have a substantial amount of missing data, potentially impacting our analysis. We'll address these missing values in the next steps to ensure robust and reliable results.
First, let's address the missing values.¶
Let's address the missing values in the cybersecurity dataset since they may cause errors in our subsequent analysis. Prior to selecting the best course of action to address the missing values, we must first identify them.
# Determine recent activity
# If the Alert Triggered is present, then it's a yes, else it's a no.
df['Alerts/Warnings'] = df['Alerts/Warnings'].apply(lambda x: 'yes' if x == 'Alert Triggered' else 'no')
#If the Malware Indicators is present, then it's a No, else it's a No Detection.
df['Malware Indicators'] = df['Malware Indicators'].apply(lambda x: 'No Detection' if pd.isna(x) else x)
#If Proxy Information is missing, it is assumed that there is no proxy
df['Proxy Information'] = df['Proxy Information'].apply(lambda x: 'No proxy' if pd.isna(x) else x)
#If Firewall Logs is missing, it is assumed that there is no data
df['Firewall Logs'] = df['Firewall Logs'].apply(lambda x: 'No Data' if pd.isna(x) else x)
#If IDS/IPS Alerts is "No Data", then it means that the alert was not generated by IDS/IPS.
df['IDS/IPS Alerts'] = df['IDS/IPS Alerts'].apply(lambda x: 'No Data' if pd.isna(x) else x)
df.isnull().sum().sort_values(ascending=False)
Timestamp 0 Attack Type 0 IDS/IPS Alerts 0 Firewall Logs 0 Proxy Information 0 Geo-location Data 0 Network Segment 0 Device Information 0 User Information 0 Severity Level 0 Action Taken 0 Attack Signature 0 Alerts/Warnings 0 Source IP Address 0 Anomaly Scores 0 Malware Indicators 0 Payload Data 0 Traffic Type 0 Packet Type 0 Packet Length 0 Protocol 0 Destination Port 0 Source Port 0 Destination IP Address 0 Log Source 0 dtype: int64
Removed all missing values from the information
Explore the Device Information Column¶
df['Device Information'].value_counts()
Device Information
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 6.2; Trident/3.0) 35
Mozilla/5.0 (compatible; MSIE 5.0; Windows 98; Trident/4.1) 34
Mozilla/5.0 (compatible; MSIE 6.0; Windows CE; Trident/4.0) 33
Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/3.0) 31
Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.2; Trident/4.1) 31
..
Mozilla/5.0 (Macintosh; PPC Mac OS X 10_9_2; rv:1.9.2.20) Gecko/6474-09-17 07:53:12 Firefox/3.6.9 1
Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWebKit/535.0 (KHTML, like Gecko) CriOS/19.0.850.0 Mobile/88P921 Safari/535.0 1
Mozilla/5.0 (Windows NT 5.0; km-KH; rv:1.9.2.20) Gecko/7799-03-13 07:30:55 Firefox/3.8 1
Mozilla/5.0 (X11; Linux i686; rv:1.9.7.20) Gecko/6248-04-01 13:49:59 Firefox/3.8 1
Mozilla/5.0 (iPod; U; CPU iPhone OS 3_0 like Mac OS X; tg-TJ) AppleWebKit/534.33.5 (KHTML, like Gecko) Version/4.0.5 Mobile/8B116 Safari/6534.33.5 1
Name: count, Length: 32104, dtype: int64
# Extract 'Device'
df['Browser'] = df['Device Information'].str.split('/').str[0]
# created the Browser column.
df['Browser']
0 Mozilla
1 Mozilla
2 Mozilla
3 Mozilla
4 Mozilla
...
39995 Mozilla
39996 Mozilla
39997 Mozilla
39998 Mozilla
39999 Mozilla
Name: Browser, Length: 40000, dtype: object
import re
# OS and device patterns to search for
patterns = [
r'Windows',
r'Linux',
r'Android',
r'iPad',
r'iPod',
r'iPhone',
r'Macintosh',
]
def extract_device_or_os(user_agent):
for pattern in patterns:
match = re.search(pattern, user_agent, re.I) # re.I makes the search case-insensitive
if match:
return match.group()
return 'Unknown' # Return 'Unknown' if no patterns match
# Extract device or OS
df['Device/OS'] = df['Device Information'].apply(extract_device_or_os)
df['Browser'].value_counts()
Browser Mozilla 31951 Opera 8049 Name: count, dtype: int64
The dataset has 31,951 occurrences of the Mozilla browser and 8,049 instances of the Opera browser are included in the sample, indicating a significant preference for Mozilla over Opera.
df['Device/OS'].value_counts()
Device/OS Windows 17953 Linux 8840 Macintosh 5813 iPod 2656 Android 1620 iPhone 1567 iPad 1551 Name: count, dtype: int64
With 17,953 instances, Windows is the most widely used operating system, followed by Linux (8840 instances) and Macintosh (5813 instances), according to the results. There are far fewer examples of mobile devices, such as the iPod, Android, iPhone, and iPad (2656 for iPod, 1551 for iPad). The devices and OS are displayed in this data count.
#Dropping the Device Information Column
df = df.drop('Device Information', axis = 1)
def extract_time_features(df, Timestamp):
# Convert timestamp column to datetime if it's not already
df[Timestamp] = pd.to_datetime(df[Timestamp])
# Extract time features
df['Year'] = df[Timestamp].dt.year
df['Month'] = df[Timestamp].dt.month
df['Day'] = df[Timestamp].dt.day
df['Hour'] = df[Timestamp].dt.hour
df['Minute'] = df[Timestamp].dt.minute
df['Second'] = df[Timestamp].dt.second
df['DayOfWeek'] = df[Timestamp].dt.dayofweek
return df
# Assuming df is your DataFrame
# Call the function and store the result in a new DataFrame
new_df = extract_time_features(df, 'Timestamp')
# Check if new columns are created
print(new_df.head())
Timestamp Source IP Address Destination IP Address Source Port
0 2023-05-30 06:33:58 103.216.15.12 84.9.164.252 31225 \
1 2020-08-26 07:08:30 78.199.217.198 66.191.137.154 17245
2 2022-11-13 08:23:25 63.79.210.48 198.219.82.17 16811
3 2023-07-02 10:38:46 163.42.196.10 101.228.192.255 20018
4 2023-07-16 13:11:07 71.166.185.76 189.243.174.238 6131
Destination Port Protocol Packet Length Packet Type Traffic Type
0 17616 ICMP 503 Data HTTP \
1 48166 ICMP 1174 Data HTTP
2 53600 UDP 306 Control HTTP
3 32534 UDP 385 Data HTTP
4 26646 TCP 1462 Data DNS
Payload Data ... Log Source Browser
0 Qui natus odio asperiores nam. Optio nobis ius... ... Server Mozilla \
1 Aperiam quos modi officiis veritatis rem. Omni... ... Firewall Mozilla
2 Perferendis sapiente vitae soluta. Hic delectu... ... Firewall Mozilla
3 Totam maxime beatae expedita explicabo porro l... ... Firewall Mozilla
4 Odit nesciunt dolorem nisi iste iusto. Animi v... ... Firewall Mozilla
Device/OS Year Month Day Hour Minute Second DayOfWeek
0 Windows 2023 5 30 6 33 58 1
1 Windows 2020 8 26 7 8 30 2
2 Windows 2022 11 13 8 23 25 6
3 Macintosh 2023 7 2 10 38 46 6
4 Windows 2023 7 16 13 11 7 6
[5 rows x 33 columns]
df.head(5)
| Timestamp | Source IP Address | Destination IP Address | Source Port | Destination Port | Protocol | Packet Length | Packet Type | Traffic Type | Payload Data | ... | Log Source | Browser | Device/OS | Year | Month | Day | Hour | Minute | Second | DayOfWeek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-05-30 06:33:58 | 103.216.15.12 | 84.9.164.252 | 31225 | 17616 | ICMP | 503 | Data | HTTP | Qui natus odio asperiores nam. Optio nobis ius... | ... | Server | Mozilla | Windows | 2023 | 5 | 30 | 6 | 33 | 58 | 1 |
| 1 | 2020-08-26 07:08:30 | 78.199.217.198 | 66.191.137.154 | 17245 | 48166 | ICMP | 1174 | Data | HTTP | Aperiam quos modi officiis veritatis rem. Omni... | ... | Firewall | Mozilla | Windows | 2020 | 8 | 26 | 7 | 8 | 30 | 2 |
| 2 | 2022-11-13 08:23:25 | 63.79.210.48 | 198.219.82.17 | 16811 | 53600 | UDP | 306 | Control | HTTP | Perferendis sapiente vitae soluta. Hic delectu... | ... | Firewall | Mozilla | Windows | 2022 | 11 | 13 | 8 | 23 | 25 | 6 |
| 3 | 2023-07-02 10:38:46 | 163.42.196.10 | 101.228.192.255 | 20018 | 32534 | UDP | 385 | Data | HTTP | Totam maxime beatae expedita explicabo porro l... | ... | Firewall | Mozilla | Macintosh | 2023 | 7 | 2 | 10 | 38 | 46 | 6 |
| 4 | 2023-07-16 13:11:07 | 71.166.185.76 | 189.243.174.238 | 6131 | 26646 | TCP | 1462 | Data | DNS | Odit nesciunt dolorem nisi iste iusto. Animi v... | ... | Firewall | Mozilla | Windows | 2023 | 7 | 16 | 13 | 11 | 7 | 6 |
5 rows × 33 columns
df.describe(include = 'object')
| Source IP Address | Destination IP Address | Protocol | Packet Type | Traffic Type | Payload Data | Malware Indicators | Alerts/Warnings | Attack Type | Attack Signature | ... | Severity Level | User Information | Network Segment | Geo-location Data | Proxy Information | Firewall Logs | IDS/IPS Alerts | Log Source | Browser | Device/OS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | ... | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 | 40000 |
| unique | 40000 | 40000 | 3 | 2 | 3 | 40000 | 2 | 2 | 3 | 2 | ... | 3 | 32389 | 3 | 8723 | 20149 | 2 | 2 | 2 | 2 | 7 |
| top | 103.216.15.12 | 84.9.164.252 | ICMP | Control | DNS | Qui natus odio asperiores nam. Optio nobis ius... | IoC Detected | no | DDoS | Known Pattern A | ... | Medium | Ishaan Chaudhari | Segment C | Ghaziabad, Meghalaya | No proxy | Log Data | No Data | Firewall | Mozilla | Windows |
| freq | 1 | 1 | 13429 | 20237 | 13376 | 1 | 20000 | 20067 | 13428 | 20076 | ... | 13435 | 6 | 13408 | 16 | 19851 | 20039 | 20050 | 20116 | 31951 | 17953 |
4 rows × 21 columns
df.columns
Index(['Timestamp', 'Source IP Address', 'Destination IP Address',
'Source Port', 'Destination Port', 'Protocol', 'Packet Length',
'Packet Type', 'Traffic Type', 'Payload Data', 'Malware Indicators',
'Anomaly Scores', 'Alerts/Warnings', 'Attack Type', 'Attack Signature',
'Action Taken', 'Severity Level', 'User Information', 'Network Segment',
'Geo-location Data', 'Proxy Information', 'Firewall Logs',
'IDS/IPS Alerts', 'Log Source', 'Browser', 'Device/OS', 'Year', 'Month',
'Day', 'Hour', 'Minute', 'Second', 'DayOfWeek'],
dtype='object')
Data Visualization¶
# Checking the Day Column ploting with plotly
plt = px.histogram(
df,
x='Day',
color='Malware Indicators',
title='Number of Malware Attacks by Day',
color_discrete_map={'0': 'lightblue', '1': 'salmon'} # Choose any two contrasting colors
)
plt.show()
The above histogram shows that the 9th day of the month experienced. Highest number of attacks, totaling 720. The chart also indicates the variability in malware attack frequency across different days, highlighting potential patterns in attack frequency.
# month Distribution
plt = px.histogram(
df,
x='Month',
title='Month',
color_discrete_sequence=px.colors.qualitative.Pastel # Using the same Pastel color sequence
)
plt.show()
# Checking the Month Column ploting with plotly
plt = px.histogram(
df,
x='Month',
color='Malware Indicators',
title='Number of Malware Attacks by Month',
color_discrete_map={'0': 'lightblue', '1': 'salmon'} # Choose any two contrasting colors
)
plt.show()
The graph displays the number of malware assaults that occurred in August. 1,861 incidents were counted. Compared to previous months, this one has had the most attacks. It indicates that August had higher risks.
# Year Distrition
plt = px.histogram(
df,
x='Year',
title='Year',
color_discrete_sequence=px.colors.qualitative.Pastel # You can specify any color code or name you prefer
)
plt.show()
# Checking the Day Column ploting with plotly
plt = px.histogram(
df,
x='Year',
color='Malware Indicators',
title='Number of Malware Attacks by Year',
color_discrete_map={'0': 'lightblue', '1': 'salmon'} # Choose any two contrasting colors
)
plt.show()
The histogram graph shows that most malware attacks were happened from mid-2021 to mid 2022. This period experienced the highest frequency of incidents. It highlights a significant increase in cyber attack during this time periods
# Checking the Protocol distribution with Bar Chart Using Plotly
plt = px.histogram(
df,
x='Protocol',
color='Malware Indicators',
title='Number of Malware Attacks by Protocol',
color_discrete_map={'0': 'lightblue', '1': 'salmon'} # Choose any two contrasting colors
)
plt.show()
Network Protocol Descriptions (ICMP, UDP, TCP)¶
ICMP (Internet Control Message Protocol):
- ICMP sends error messages and operational information about packet processing issues.
- It is used for diagnostics, like ping and traceroute.
- ICMP operates at the network layer (Layer 3) of the OSI model.
UDP (User Datagram Protocol):
- UDP is a simple, connectionless protocol for sending packets with low overhead.
- It operates at the transport layer (Layer 4) and is used for fast, efficient applications like streaming and gaming.
TCP (Transmission Control Protocol):
- TCP is a connection-oriented protocol ensuring reliable, ordered data delivery with error-checking.
- It operates at the transport layer (Layer 4) of the OSI model and is used for critical applications like web browsing and email.
Analyse the Traffic Type¶
# Traffic Distribution
plt = px.pie(
df,
names='Traffic Type',
title='Traffic Distribution',
color_discrete_sequence=px.colors.qualitative.Pastel # Choose any color sequence
)
plt.show()
The pie chart reveals an almost similar distribution of traffic types: DNS (33.4%), HTTP (33.4%), and FTP (33.2%). This suggests that the three traffic categories are being used in a balanced manner.
# Ploting the Traffic Type distribution with Bar Chart Using Plotly
plt = px.histogram(
df,
x='Traffic Type',
color='Malware Indicators',
title='Number of Malware Attacks by Traffic Type',
color_discrete_map={'0': 'lightblue', '1': 'salmon'} # Choose any two contrasting colors
)
plt.show()
HTTP (Hypertext Transfer Protocol):
- HTTP is used to transmit web pages and web content over the internet.
- It operates at the application layer (Layer 7) of the OSI model.
- HTTP is stateless, meaning each request-response interaction is independent.
- It is the foundation for web browsing and accessing websites.
DNS (Domain Name System):
- DNS translates domain names (like www.example.com) into IP addresses.
- It operates at the application layer (Layer 7) of the OSI model.
- DNS allows users to access websites using easy-to-remember names instead of numerical IP addresses.
FTP (File Transfer Protocol):
- FTP transfers files between a client and a server on a network.
- It operates at the application layer (Layer 7) of the OSI model.
- Common uses of FTP include uploading website files to a server and sharing files between computers.
Analyzing the Attack Type¶
# Attack Type Distribution
plt = px.pie(
df,
names='Attack Type',
title='Analysing the Attack Type Distribution',
color_discrete_sequence=px.colors.qualitative.Pastel # Choose any color sequence
)
plt.show()
The pie graphic depicts a nearly equal distribution of attack types: DDOS (33.6%), malware (33.3%), and intrusion (33.25%). This means that each form of attack occurs at a comparable frequency. The statistics indicate a balanced threat landscape among these three assault types.
# Checking the attack types distribution with Bar Chart Using Plotly
plt = px.histogram(
df,
x='Attack Type',
color='Traffic Type',
title='Number of Malware Attacks by Attack Type',
color_discrete_map={'DNS': 'lightblue', 'HTTP': 'salmon', 'FTP': 'lightgreen'} # Choose any colors you like
)
plt.show()
Analyzing the Browser, Devices and Attack Types¶
# Browsers Distribution
plt = px.pie(
df,
names='Browser',
title='Browser Distribution',
color_discrete_sequence=px.colors.qualitative.Pastel # Choose any color sequence
)
plt.show()
The "Browser Distribution" pie chart shows that Mozilla is used by 20.1% of users, while other browsers account for 79.9%. This shows that, while considerable, Mozilla is not the most popular browser among users, with the majority choosing alternative options.
# Platform Distribution
plt = px.pie(
df,
names='Device/OS',
title='Platform Distribution',
color_discrete_sequence=px.colors.qualitative.Pastel # Choose any color sequence
)
plt.show()
The chart shows the distribution of different platforms on the smartphone market. Android has the largest market share with 46.7%, followed by iOS with 23%. Windows, Linux, Macintosh, and iPod platforms all have a market share of less than 7% each.
# Platform Distribution with Bar Chart
plt = px.histogram(df, x ='Device/OS', color= 'Browser', title = 'Platform Distribution')
plt.show()
# Checking the Browser and Devices with Attack Type distribution with Bar Chart Using Plotly
plt = px.histogram(df, x= 'Device/OS', color = 'Attack Type', title = 'Number of Malware Attacks by Browser and Devices')
plt.show()
# checking the browser against the attack type
plt = px.histogram(df, x= 'Browser', color='Attack Type', title= 'Number of Attacks by Browser')
plt.show()
Analysing the Log Source, Action Taken¶
# Log Source Distribution
plt = px.histogram(df, x='Log Source', title='Log Source')
plt.show()
# Log Source Distribution
plt = px.histogram(
df,
x='Action Taken',
title='Action Taken',
color_discrete_sequence=px.colors.qualitative.Pastel # Use the custom color sequence
)
plt.show()
# Log Source Distribution
plt = px.histogram(df, x='Action Taken', color='Attack Type', title='Log Source')
plt.show()
# Log Source Distribution
plt = px.histogram(df, x='Log Source', color='Attack Type', title='Log Source')
plt.show()
Packet Length Distribution for Various Attack Types¶
import plotly.graph_objs as go
# Filter data for each attack type
malware_data = df[df['Attack Type'] == 'Malware']['Packet Length']
intrusion_data = df[df['Attack Type'] == 'Intrusion']['Packet Length']
ddos_data = df[df['Attack Type'] == 'DDoS']['Packet Length']
# Create histograms for each attack type
malware_histo = go.Histogram(x=malware_data, name='Malware', opacity=0.7)
intrusion_histo = go.Histogram(x=intrusion_data, name='Intrusion', opacity=0.7)
ddos_histo = go.Histogram(x=ddos_data, name='DDoS', opacity=0.7)
# Create layout
layout = go.Layout(title='Packet Length Distribution for Various Attack Types',
xaxis=dict(title='Packet Length'),
yaxis=dict(title='Frequency'))
# Create figure
fig = go.Figure(data=[malware_histo, intrusion_histo, ddos_histo], layout=layout)
# Show plot
fig.show()
The histogram shows the packet length distributions for Malware, Intrusion, and DDoS attacks. Malware has a definite peak, while intrusion has greater variability, and DDoS has clustered patterns, providing insights into attack characteristics for focused security methods.
Conclusion¶
Finally, the study of cybersecurity insights extracted from a dataset of 40,000 records offers light on significant patterns and vulnerabilities. It demonstrates a diverse environment of browser and device usage, with Windows devices being the primary targets for potential cyber threats. The temporal analysis reveals noteworthy patterns, such as increased assault rates on specific dates and months, which may indicate possible vulnerabilities or purposeful tactics by threat actors. Using these data, firms may strategically deploy resources, strengthen endpoint security, and improve incident response methods to boost their defenses against changing cyber threats. Businesses can improve the security of their digital infrastructure and data assets by implementing proactive steps based on these findings. .
Recommendations¶
Following are key recommendations to strengthen your organization's cybersecurity posture based on the data analysis:
Threat Intelligence Gathering:
- Enhance Real-Time Threat Monitoring: Integrate automated threat intelligence systems to stay on top of the latest cyber threats and vulnerabilities. Collaborate with cybersecurity communities to gain broader insights and develop more effective threat mitigation strategies.
- Contextual Analysis: Analyze threat data alongside external factors like global events or industry trends to anticipate and prepare for potential attacks.
Vulnerability Assessment and Risk Prioritization:
- Regular Assessments: Conduct frequent vulnerability assessments to identify and address security weaknesses before they can be exploited.
- Risk Management: Develop a risk prioritization strategy that considers both the potential impact and likelihood of various threats, allowing for focused and efficient resource allocation.
Security Awareness Training Development:
- Comprehensive Training Programs: Design training programs tailored to different user roles within the organization to improve overall security awareness and response capabilities.
- Phishing Simulations: Implement regular phishing simulations to test and reinforce employee readiness, keeping them alert to potential email-based attacks.
Endpoint Security Solutions Evaluation and Deployment:
- Advanced Endpoint Protection: Deploy robust endpoint security solutions equipped with features like antivirus, anti-malware, and firewalls. Prioritize protecting the most commonly used devices, such as Windows systems.
- Cross-Platform Security: Ensure adequate security measures are also in place for less common devices (e.g., iPads, iPods) to prevent them from becoming attacker entry points.
Cybersecurity Policy Development:
- Clear Policies and Procedures: Develop and document comprehensive cybersecurity policies outlining roles, responsibilities, and standard operating procedures for security incidents.
- Policy Review and Update: Regularly review and update policies to address emerging threats and changes in the business environment.
Incident Response Plan Documentation:
- Detailed Incident Response Plans: Create detailed incident response plans covering detection, containment, eradication, and recovery processes.
- Regular Drills: Conduct regular incident response drills to ensure all team members understand their roles and responsibilities during a cyber incident.
Security Governance Framework Implementation:
- Governance Framework: Implement a robust security governance framework to ensure that cybersecurity efforts are aligned with business objectives and regulatory requirements.
- Continuous Improvement: Establish a governance committee to oversee cybersecurity initiatives, ensuring continuous improvement and adaptation to new threats.